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Abstract 

Holographic optical tweezers allow the three dimensional, dynamic, multi- 
point manipulation of micron sized objects using laser light. Exploiting the 
massive parallel architecture of modern GPUs we can generate highly opti- 
mized holograms at video frame rate allowing the precise interactive micro- 
manipulation of 3D structures. 
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1. Introduction 

Holographic optical tweezers (HOT) use light to manipulate matter at the 
micron scale PQ. Dielectric objects, whose refractive index is higher than the 
surrounding medium can be trapped in regions of high light intensity by elec- 
tromagnetic forces arising from the scattering of light 2J. To achieve stable 
trapping in three dimensions, light has to be strongly focused using a micro- 
scope objective with high numerical aperture. Many objects can be trapped 
simultaneously if more than a single focal spot is generated around the objec- 
tive's focal plane. Digital holography provides a way to achieve this by applying 
a computer generated phase mask to a laser beam before it is sent through the 
microscope objective. The commercial availability of Spatial Light Modulators 
(SLM) has made this task easier by providing a reconfigurable support for com- 
puter generated holograms which is connected to a PC through the video output 
(usually DVI) on a standard video card [HI IH H] ■ The task of finding a phase 
mask that efficiently redistributes the available laser power among an array of 
target focal spots is not a straightforward one. Phase only modulation can easily 
give rise to unwanted focal spots ("ghost traps") or large intensity variations. 
We recently proposed an iterative procedure that achieves optimal efficiency 
and uniformity in a few tens of steps [TJ. However the resulting computational 
load is so high that the use of optimized algorithms for dynamic manipulation is 
limited to those circumstances when the sequence of moves is known in advance 
and holograms can be then pre-calculated. Such a slowness is often considered 
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as one of the major factors for preferring scanning beam techniques [5] over 
digital holography for real-time applications. 

In this paper we demonstrate that CUDA |i9i enabled GPUs can generate 
highly optimized holograms at a frame-rate that is fast enough to allow inter- 
active micro-manipulation using strong and uniform trap arrays. 

2. GPU device architecture 

Graphic Processing Units (GPU) have brought the power of parallel calculus 
to personal computers. The possibility of using a personal computer to easily 
and cheaply achieve the performances of an expensive CPU cluster is revolu- 
tionizing computational physics in a wide range of fields including molecular 
dynamics [TO] , Monte Carlo simulations [TT] , finite element analysis [T2] , lattice 
QCD [13]. The Compute Unified Device Architecture (CUDA) is a general pur- 
pose parallel computing architecture introduced by NVIDIA. CUDA provides a 
parallel programming model and software environment allowing to exploit the 
massive parallel architecture of modern Graphic Processing Units (GPU) for 
non-graphics applications. General purpose parallel algorithms can be imple- 
mented on a CUDA enabled GPU using a small set of C extensions provided 
by the CUDA SDK. The CUDA programming model closely reflects the GPU 
hardware architecture. A CUDA enabled GPU is composed of a global memory 
and a variable number of multiprocessors. Each multiprocessor includes eight 
scalar processor cores, two special function units, 8192 registers, a multithreaded 
instruction unit and one on-chip shared memory. As a result hundreds of cores 
can collectively run thousands of computing threads that can share data without 
sending it over the system memory bus. Threads are arranged in a grid of blocks 
and each block is assigned to a multiprocessor. In this way threads that belong 
to the same block can be synchronized and can cooperate using shared memory. 
Within a block, threads are arranged in groups of 32 called warps, threads in a 
warp are physically executed in parallel and are synchronized. Multiprocessors 
can only execute one warp at time, however if threads in a warp are waiting 
to access global memory the multiprocessor can stop executing that warp and 
switch to another one eliminating memory latency time. 

Such an execution model requires specific optimization strategies that, for 
the purpose of the present work, can be summarized in three general rules: 

Rule A) Keep multiprocessors busy and hide memory latency. 
To this aim one should: 

1. Group threads in a number of blocks that is multiple of the number 
of multiprocessors. 

2. Choose the number of threads per block as a multiple of 32 to avoid 
wasting time with unfilled warps. 

3. Maximize the number of active warps by using many threads per 
block. 
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4. When possible, avoid using conditional instructions that serialize the 
execution of a warp. 

Rule B) Minimize read/write operations on global memory. 

Writing and reading global memory is very slow and sometimes it can be 
even better to recalculate than to cache data. Shared memory must be 
used whenever it can reduce the access to global memory. Shared memory 
is hundreds of times faster than global memory but only 16k are currently 
available to any multiprocessor. 

Rule C) Access global memory with coalesced calls. 

When all threads in a half warp execute a read/write instruction, the hard- 
ware detects whether threads access consecutive global memory locations 
and coalesces all these accesses. 



3. Optimized algorithms for holographic trapping 

In back focal plane phase modulation we use an SLM to apply an array 
of phase shifts to a plane wave at the back focal plane of a focusing optical 
system (Fig. [I]). Our task here is to calculate the best phase mask so that the 
modulated wavefront propagating through the optical system is focused onto an 
array of chosen target spots. Given the phase shift on each pixel <pj the complex 
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Figure 1: Schematic representation of Fourier optics propagation from SLM plane (back focal 
plane) to the front focal plane of the optical system. 

field on a target point m, with coordinates x m ,y m , z mi is given by [7]: 

= i £ e^T) (i) 



Where N is the number of pixels, i is the imaginary unit and A™ 1 is the phase 
acquired upon propagation: 
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Figure 2: Flowchart representing K iterations of GSW algorithm. 



A T = JJf^ + Vf) + ^(xjx™ + yj y m ) (2) 

where / is the effective focal length of the focusing optics (L3, L4, MO in 
Fig. [(jj), A is the laser wavelength and Xj, yj are the coordinates of the j th pixel. 
If we want to send all the light through a single point m — 1 then we should 
set (f>j = Aj, so that V\ = 1. When considering multiple traps, a phase only 
modulation might not be able to split all the available power uniformly among 
the target points. For each pixel we now have the multiple choices A™ (the 
single trap holograms) and finding a compromise could seem a hopeless task. A 
first, reasonably fast recipe is that of taking the complex superposition of single 
trap holograms [14"] : 

^•=arg e^T+^) (3) 

m=l,M 

Where <fij is again the phase of the j th SLM's pixel, m th is the trap index, M 
is the number of traps, 9 m is a random phase relative to the m th trap. Such 
a procedure, usually referred as the random superposition algorithm (SR), is 
computationally rather fast but usually results in ghost traps and poor unifor- 
mities, especially when dealing with ordered structures. A quantitative measure 
of the hologram performance can be obtained by defining an efficiency (e) and 
a uniformity (u) parameters as a function of the fractions of total power flowing 
through the m th trap I m = \V m \ 2 : 



_ max[/ m ] - min[/ m ] 
max[/ m ] + min[/ m ] 
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A poor performance may result in particles getting trapped in unwanted ghost 
trap sites or bead escape from temporary low intensity traps. When such events 
are acceptable SR provides a good choice for real time manipulation, but if a 
higher degree of control is required a more performing algorithm is needed. A 
good candidate is the GSW algorithm (weighted Gerchberg-Saxton [7]) which 
gives excellent results in terms of efficiency and uniformity. The basic idea 
behind GSW is that, if aiming at uniform trap intensities with SR leads to 
nonuniformities, we may hope that there's a choice of non uniform target traps' 
intensities resulting in an evenly spread trapping light. GSW allows to calcu- 
late such non uniform weights w m by the iterative procedure illustrated in the 
flowchart reported in Fig. [2j Angle brackets in the update weights box of Fig. 
[2] represent averaging over the trap index m. After a few tens of iterations 
the procedure converges to almost perfectly uniform trap intensities so that 
\V m \ ~ (|U TO |) and the weights w m don't get updated anymore. 

4. Implementing HOT algorithms on a CUDA device 
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Figure 3: Schematic diagram of the kernel BackPropKer calculating the phase shifts on the 
SLM by a backward propagation of light emerging from the target spots. 

The parallel architecture of GPUs is particularly suited for digital hologra- 
phy, whose basic task is that of performing complex algebra over a large array of 
independent pixels. In the field of digital holography GPUs have been used for 
real-time holographic microscopy \T5\ [TB] or holographic displays [T7J [T5] . In the 
field of optical trapping, the possibility of generating holograms with real-time 
frame rate is very attractive for interactive applications. Early attempts always 
suffered the slowness of CPU resulting in either slow or low efficiency holograms 
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P^l |2"D] . More recently, custom shading programs running on the GPU have 
been used to achieve a considerable speedup in hologram generation, although 
always being limited to quick and poorly performing algorithms |21| 122] . The 
CUDA architecture makes it a lot easier to implement more complex algorithms 
in a general purpose environment which is not limited to graphic applications. 
When using a CUDA enabled video card, results can be also computed directly 
on the frame buffer avoiding useless memory transfers. 
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Table 1: Comparison table summarizing time costs of SR algorithm and the relative impor- 
tance of optimization rules. The GPU is a GeForce GTX 260 and speedup data are evaluated 
with respect to a Pentium D 3.2 GHz. 

Both of the previously discussed algorithms require the common step of 
backward light propagation from the M target traps back to the N SLM pixels. 
In our case the SLM is placed in the back Fourier plane of the optical system so 
backward propagation is obtained by Eq. [3j As shown in Fig [3] the procedure 
can be translated into a kernel having as input arguments the full trap structure 
described by the M coordinates, weights and phases: (x m , y m , z m ), w m , 9 m . SR 
holograms are obtained by putting w m — 1 and choosing 9 rn as random phases. 
We implemented such a procedure in the single kernel BackPropKer having a 
number of threads equal to the number of SLM pixels. Each thread evaluates a 
single phase modulation <f)j and stores it in a linear array residing in the global 
memory. According to rule C in section [2j it is important that contiguous 
threads write on contiguous pixels phase data so that coalesced memory access 
is guaranteed. As discussed in rule A in section [2] we use blocks containing 
a number of threads that is large and multiple of 32. Each thread needs to 
access the full trap structure so that a significant speedup can be achieved by 
preloading the trap data in the shared memory as prescribed by rule B. In 
each block only M threads cooperate to read the traps' data. At this point 
we are ready to evaluate the time performance of BackPropKer in generating 
holograms using the SR algorithm. To this aim we first generate M random 
phases 9 m on the CPU and than store the trap structure on the global memory. 
Using a GeForce GTX 260 we can generate 768x768 SR holograms 350 times 
faster than using a Pentium D 3.2 GHz. The time spent by SR to compute a 
hologram grows linearly with the number of traps with a time per trap coefficient 
of 0.35 ms/trap. As an illustration of the relative importance of the considered 
optimization rules, we compare in Table 1 the most efficient kernel, where all 
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Figure 4: Schematic diagram of the kernel SumRedKer. For the sake of simplicity we only 
show a single block of four threads. In the first step each thread sums up two elements of the 
global memory and stores the result on shared memory. In the following steps the number of 
active threads and the terms to sum are halved until a block is left with one single term. The 
i th block will write its partial sum on the i th address the global memory array. 



this rules are obeyed, to partially optimized kernels. 

Turning now to the better performing GSW algorithm, in addition to a back 
propagation kernel we need a procedure to forward propagate the fields from 
SLM pixels to target traps (Eq. [IJ. Such a procedure can be decomposed into 
two main tasks: i) calculate the contribution of each pixel to the complex field 
V m on the m th trap's location, ii) sum up all contribution to obtain V m . The 
second task is a very common one and it's widely discussed in the CUDA SDK 
examples. This procedure is based on the sum reduction kernel SumRedKer 
that performs partial sums, reducing the number of terms. A loop iterates 
SumRedKer until one single term is left containing the sum of all elements. A 
schematic representation of SumRedKer is reported in Fig. [3]where a single block 
is shown. Each block contains blocksize threads that perform the partial sum 
of 2xblocksize elements and writes the result back to the global memory. At 
the end of SumRedKer a number of terms equal to the number of used blocks 
still remains to be summed. Therefore a sequence of log(iV)/ log(blocksize) 
kernels is needed to perform the whole sum. 

The evaluation of Eq. [T] also requires the task of calculating the contribu- 
tion of the field radiating from each pixel to the total trap field V m . Such a 
contribution is obtained calculating the phase shifts A™ in Eq. [2] and building 

the complex exponentials e t ^ j ~ A j \ In order to reduce read/write operations 
on global memory we use a slightly different version of the sum reduction kernel 
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Figure 5: GSW computational time. The computational time per trap per iteration is 
reported as a function of the number of target points. Total time (black line) is the sum of 
the time required for propagation (green line) plus the pixel phase evaluation (red line). 

as the first partial sum step. The first SumRedKer will begin having the phase 
modulations cj>j on the global memory locations gj in Fig. [4] so that we need 
to calculate complex exponential before the first write on shared memory (i.e. 
si = e 4 ^ 1 - A ") + e *(<fe-A™)). The phases A™ 1 (M*N in total) are needed both 
for forward and backward propagation routines. Observing that such phases 
are fixed for a chosen trap geometry, one could think that precaching them in 
global memory could save computational time. However we checked that direct 
calculation is always faster (see rule B). Once V m s are known, the calculation of 
the weights w m is quick and straightforward. The time required by GSW grows 
almost linearly with the number of traps or iterations. In Fig. [5] we report the 
computational time per trap per iteration as a function of traps number. Devia- 
tions from linearity are observed for small traps number evidencing the presence 
of a time cost which is essentially independent from the number of traps and 
is probably due to memory read/write operations. As we can see from the fig- 
ure, we can neglect the small deviations from linearity and define a time per 
trap per iteration. Using a GeForce GTX 260 we obtain 0.44 ms/trap/iteration 
obtaining a 45x speedup respect to a Pentium D 3.2 GHz. 

5. Real-time manipulation 

Our optical tweezers are based upon a Nikon TE2000U inverted microscope 
with a lOOx objective lens, NA 1.4. To form the trap we use a Nd:YAG laser, 
frequency-doubled to give a maximum power of 3 W at 532 nm (LaserQuantum 
Opus). After expansion and collimation, the beam from this laser is reflected 
off a computer-controlled SLM (HoloEye LCR 2500). Our SLM is based on a 
liquid crystal reflective micro-display. A laser beam reflecting off the SLM will 
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Figure 6: Schematic view of the experimental setup for holographic optical trapping. 
L1,L2,L3,L4: planoconvex achromats, DM: dichroic mirror, BS: beam splitter, MO: micro- 
scope objective, WP: half waveplatc. 

emerge with a phase retardation that can be modulated on a pixel by pixel 
basis. Phase modulation is achieved by electrically addressing the pixels and 
therefore locally reorienting the nematic axis of liquid crystal molecules. When 
a grayscale, 8bit depth image is displayed on the SLM, a proper pattern of 
voltages is relayed to the pixels so that each grayscale value is linearly mapped 
to a phase shift ranging from to 27r. Light reflected off the SLM is then coupled 
to the microscope by projecting a demagnificd image of the SLM plane on the 
back focal plane of the microscope objective. An array of optical traps is then 
produced around the front focal plane of the objective located in a colloidal 
water suspension above the coverslip. The SLM was controlled by a host PC 
equipped with a NVidia GeForce GTX 260 video card. User input is managed 
by a GUI mainloop thread (Tkinter) running in a Python shell while a Python 
module wraps the CUDA library functions providing a high level interface to 
the GPU hologram generation. 

As a demonstration of real-time manipulation using optimized GPU gener- 
ated holograms, we show the simultaneous trapping and manipulation of eight 
silica beads (2/im diameter) in water. Optimized holograms are obtained with 
5 GSW iterations at a rate of 48 Hz following user input. Fig. [H] shows three 
frames from the corresponding SLM and CCD timelines. While a hologram 
movie is displayed on the SLM (lower timeline) based on user input, a dynamic 
3D micro-hologram, consisting of an array of moving bright light spots, is pro- 
jected in the sample volume providing dynamical, and real-time reconfigurable 
optical traps. Trapped beads are imaged with bright light illumination on a 
CCD camera (upper timeline). The actual frame-rate is slightly lowered due 
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Figure 7: GSW performance. We report the efficiency (e) and uniformity (u) for GSW 
generated holograms as a function of the number of traps. The number of GSW iterations is 
always such to work at a fixed framerate of 20 Hz. Holograms with a performance above 90% 
can be generated at 20Hz for trap arrays as large as 16. 

to time lost in copying from device memory to host memory and then back to 
the video card output where the SLM is attached. This further delay could 
be avoided exploiting CUDA-OpcnGL interoperability. In this way holograms 
could be calculated directly on the frame buffer and displayed on the SLM 
without passing through the host. Ultimately the frame-rate is limited by SLM 
response time, which, for liquid crystal based devices, is typically about 20 Hz. 
Such a frame-rate allows to perform a large enough number of GSW iterations 
to generate large arrays of traps with a high efficiency and uniformity. In FigjT] 
we report the efficiency (e) and uniformity (it) for GSW generated holograms 
as a function of the number of traps arranged in a 2D square grid. The num- 
ber of GSW iterations is always such that holograms are generated at a fixed 
framerate of 20 Hz. Though efficiency is never lower than 85% uniformity falls 
down to 0.36 for a 9x9 grid where only one GSW iterations is allowed in order to 
keep the frame-rate at 20 Hz. We note here that even one single GSW iteration 
results in a significant improvement in performance over SR which would give 
an efficiency of 70% and a uniformity of only a few percents. 

In conclusion, we have used a CUDA enabled video card to generate opti- 
mized holograms for optical trapping with a speedup of 350x (SR) and 45x 
(GSW) over the host CPU. The obtained speedup allowed us to trap and 
manipulate multiparticle 3D structures with efficient and uniform trap arrays 
in real time. Our results demonstrate that the high computational load of 
hologram generation cannot be considered any longer as a limiting factor of 
holographic trapping for real time applications. We acknowledge support from 
INFM through the Seed-project. 
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Figure 8: Frames from a movie showing the interactive micro-manipulation of 8 silica beads 
with 2 fj,m diameter (watch the full movie in supplementary online material). The beads are 
arranged on the vertices of a 5 /im side cube which is then rigidly rotated. Bottom row shows 
the corresponding frames (holograms) displayed on the SLM. 
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